This document aims to outline the data structure and summarize the available information from small-scale fishery WCS surveys starting from September 1995. The code and functions used in this analysis are part of an R package within a GitHub repository.
Data Structure
The dataset comprises small-scale fishery survey data collected since September 1995 across various landing sites or Beach Management Units (BMUs). The dataset merges two versions: one from the legacy database (September 1995 - October 2022) and another from the ongoing survey (from December 2022). The data are stored in a MongoDB database and are accessible through the peskas.kenya.data.pipeline package.
The dataset contains the following variables:
Code
data_table<-dplyr::tibble( N. =seq(1, length(names(valid_data))), Variable =names(valid_data), Type =c("Numeric", "Categorical", "Categorical", "Date", "Categorical", "Categorical","Numeric", "Numeric", "Numeric", "Numeric", "Categorical", "Categorical", "Text", "Numeric", "Numeric","Numeric", "Numeric"), Description =c("Data version identifier, 1 indicates legacy landings, 2 ongoing landings","Unique identifier for each survey submission","Unique identifier for each catch in the survey","Date of the fishing landing","Name of the site where landing occurs (BMU)","Name of the fishing ground","Latitude of the fishing site","Longitude of the fishing site","Number of fishers involved in the catch","Number of boats observed on the landing site","Specific type of fishing gear used","Category of fish caught","Size category of the fish caught","Weight of the catch in kilograms","Total weight of the catch in kilograms","Catch price in KES","Total price of the catch in KES"))reactable::reactable(data_table, striped =TRUE, highlight =TRUE, pagination =FALSE, columns =list("N."=reactable::colDef(maxWidth =100), Description =reactable::colDef(minWidth =240)), theme =reactable::reactableTheme( borderColor ="#dfe2e5", stripedColor ="#f6f8fa", highlightColor ="#f0f5f9", cellPadding ="8px 12px", style =list( fontFamily ="-apple-system, BlinkMacSystemFont, Segoe UI, Helvetica, Arial, sans-serif")))
Below is a breakdown of the categorical variables:
Code
ranked<-valid_data%>%dplyr::select(dplyr::where(is.character))%>%dplyr::select(-c("submission_id", "catch_id", "version", "fishing_ground", "size"))%>%tidyr::pivot_longer(dplyr::everything(), names_to ="variable", values_to ="value")%>%dplyr::count(variable, value)catch_names_minor<-ranked%>%dplyr::filter(variable=="fish_category")%>%dplyr::mutate( tot =sum(n), perc =n/tot*100)%>%dplyr::filter(perc<0.5)%>%dplyr::pull(value)plt<-ranked%>%dplyr::mutate(value =dplyr::case_when(variable=="fish_category"&value%in%catch_names_minor~"Other",TRUE~value))%>%ggplot(aes(x =tidytext::reorder_within(value, n, variable), y =n))+ggiraph::geom_bar_interactive(aes(fill =variable, tooltip =paste("Value:", value, "<br>Count:", n)), stat ="identity", alpha =0.5)+facet_wrap(~variable, scales ="free")+tidytext::scale_x_reordered()+coord_flip()+theme_minimal()+ggsci::scale_fill_jama()+theme( panel.grid =element_blank(), legend.position ="")+labs( title ="Ranking of categorical variables", x ="", y ="")ggiraph::girafe( ggobj =plt, options =list(ggiraph::opts_tooltip( opacity =0.8, use_fill =TRUE, use_stroke =FALSE, css ="padding:5pt;font-family: Open Sans;font-size:1rem;color:white")))
Preprocessing
The preprocessing operations were performed using the preprocess_legacy_landings and preprocess_landings functions, designed to harmonize two distinct data collection periods: the legacy database (up to October 2022) and the ongoing survey system (from December 2022).
Removed derived variables (e.g., Month, Year, Day) as they can be generated when needed
Stored static information related to landing sites in Google Sheets metadata tables
Converts data into a tidy format, where each column represents a single variable, each row corresponds to one observation, and all variables have the same unit of observation (For more information, see Tidying Data)
Prepares data for subsequent validation checks
A key challenge was harmonizing gear types between the two systems. The legacy database contained multiple gear classifications that needed consolidation:
Traditional fishing nets like prawnseine, jarife, sardinenet, and mosquitonet were grouped into a single “other_nets” category
- Similar methods were unified (e.g., “spear” and “harpoon” became “speargun”, while “longline” and “handline” were standardized as “handline”)
- Major gear types were preserved as distinct categories: gillnet, ringnet, beachseine, and setnet
The ongoing system introduced Swahili terms that needed to be matched with these established categories. For example, “Aina_ya_zana_ya_uvuvi” (fishing gear type) entries were mapped to standardized English terms following specific rules:
- “beachseine” → “speargun” (according to KoboToolbox metadata information)
- “net” → “gillnet”
- “handline” → “setnet”
- “ringnet” → “monofilament”
Fish categories also required standardization. The ongoing system uses Swahili terms with explicit size classifications:
“mchanganyiko” → “rest of catch” (for mixed catches)
The preprocessing system also handles catch reporting inconsistencies. When total catch values don’t match the sum of individual catches, the system:
Uses the sum of individual catches if there’s a mismatch
Trusts the total catch form if individual catches sum to zero but a total is reported
Maintains original values when no discrepancy exists
Validation
The validation process employs a systematic approach to identify and handle potential data quality issues. The process uses different alert codes to flag specific validation concerns.
Logical Checks
Before applying outlier detection, each record undergoes logical checks to spot fundamentally inconsistent data:
Fishers vs. Boats
A submission is flagged if the number of fishers (no_of_fishers) is less than the number of boats (n_boats).
Exception: Submissions with exactly 1 fisher and 1 boat are considered valid and are not flagged.
Gear Requiring Boats
Certain fishing gears (gillnet, handline, traps, monofilament, reefseine, ringnet, setnet) require at least one boat. If n_boats == 0 and the gear used is in this list, the submission is flagged.
Negative Catch
Any submission with a negative total catch (total_catch_kg < 0) is flagged.
Non-positive Effort
Submissions with zero or negative fishers (no_of_fishers <= 0) or negative boat values (n_boats < 0) are flagged.
Zero Catch With Non-zero Effort
Submissions where the total catch is zero (total_catch_kg == 0) while reporting non-zero effort (fishers or boats) are flagged.
Records flagged by any of these conditions are removed from the dataset prior to further validation and outlier screening.
Temporal Validation
The most basic validation step ensures landing dates are not earlier than 1990 (alert code 6). Records failing this check have their dates set to NA and are flagged for review.
Statistical Validation Approach
For numerical variables (number of fishers, boats, and catches), we implemented the Median Absolute Deviation (MAD) method to identify outliers. This approach provides a robust way to detect anomalies while accounting for natural variability in fisheries data. Specifically:
MAD flags outliers based on the data’s own distribution, rather than an arbitrary threshold.
By adjusting the value of k, analysts can control the sensitivity to outliers:
Lower k = more sensitive (flags more points)
Higher k = less sensitive (flags fewer points)
This method is well-suited for fisheries because:
It automatically adapts to distribution extremes.
It is robust against non-normal distributions.
It can be fine-tuned via k to match expected variability.
Sensitivity Parameters
Different k values were chosen based on the expected variability for each metric: - Catch data: k = 2 (more sensitive due to high natural variability)
- Number of fishers: k = 3 (less sensitive to accommodate different fishing operations)
- Number of boats: k = 3 (less sensitive to accommodate fleet variations)
The histogram below demonstrates how MAD identifies outliers using the example of boat counts (n_boats). The dashed lines show thresholds at 14.71 and 20.8 boats (k = 1.5 and k = 3, respectively). Lower k results in more stringent outlier detection, while higher k is more permissive.
Example Data
This histogram uses simulated data to illustrate MAD-based outlier detection. Although the numbers are artificial, the concept mirrors what is applied to real fisheries data.
Code
plt<-ggplot(fake_data, aes(n_boats))+theme_minimal()+geom_histogram(bins =30, fill ="#52817f", color ="white", alpha =0.5)+geom_vline(xintercept =upb1.5, color ="firebrick", linetype =2)+geom_vline(xintercept =upb3, color ="firebrick", linetype =2)+theme(panel.grid =element_blank())+annotate("text", x =upb1.5+0.5, y =Inf, label ="k value = 1.5", color ="firebrick", vjust =2, hjust =0)+annotate("text", x =upb3+0.5, y =Inf, label ="k value = 3", color ="firebrick", vjust =4, hjust =0)+scale_y_continuous(labels =scales::comma)+scale_x_continuous(n.breaks =10)+coord_cartesian(expand =FALSE)+labs( x ="Number of boats", y ="Count", title ="Outlier detection for number of boats using MAD")girafe( ggobj =plt, options =list(opts_tooltip( opacity =0.8, use_fill =TRUE, use_stroke =FALSE, css ="padding:5pt;font-family: Open Sans;font-size:1rem;color:white")))
Catch Weight Validation
Catch validation required special attention due to the hierarchical nature of the data. For catch weight data, the validation process is particularly sophisticated as it accounts for variations in catch sizes across different fishing methods. The data were grouped by gear type and fish category to account for these variations, allowing for a more accurate identification of outliers specific to each gear category and fish group.
The system implements:
Individual Catch Validation:
Computes separate thresholds for each gear type and fish category combination
Flags outliers with alert code 9
Accounts for gear-specific catch variations
Total Catch Validation:
Validates the total catch independently
Uses gear type and landing site as a grouping factor
Flags anomalies with alert code 10
Cross-checks with sum of individual catches
When an alert is triggered, the corresponding value is set to NA for analytical purposes while preserving the original value and alert code in the database for review.
Validation flags summary
Alert Code 1: Condition: The number of fishers is less than the number of boats, or the number of fishers equals the number of boats but is not 1. Purpose: Ensures that the number of fishers is logically sufficient to operate the reported number of boats. Captures cases where the data might indicate unrealistic or inconsistent entries.
Alert Code 2: Condition: The number of boats is 0 for gear types that require boats (e.g., “gillnet,” “handline,” “traps,” etc.). Purpose: Ensures that boat-dependent fishing methods report at least one boat. Flags missing or incorrect entries for boat counts when specific gear types are used.
Alert Code 3: Condition: The total catch weight is negative. Purpose: Ensures that the reported catch weight is valid, as negative values are impossible and typically indicate data entry errors.
Alert Code 4: Condition: The number of fishers is less than or equal to 0, or the number of boats is less than 0. Purpose: Ensures that both the number of fishers and boats are positive integers, identifying cases with invalid or missing data for these variables.
Alert Code 5: Condition: The total catch weight is 0, but the number of fishers or boats is greater than 0. Purpose: Flags cases where fishers or boats are reported but no catch is recorded. Indicates possible missing data or inconsistencies.
Alert Code 6: Condition: The landing_date is before 1990-01-01. Purpose: Ensures that the landing_date is valid and falls within the expected range. Invalid dates are flagged, and the corresponding entries are set to NA.
Alert Code 7: Condition: The number of fishers (no_of_fishers) is an outlier detected using statistical bounds calculated by univOutl::LocScaleB(). Purpose: Identifies unrealistic values in the number of fishers by applying robust outlier detection methods. Values exceeding the bounds are flagged and set to NA.
Alert Code 8: Condition: The number of boats (n_boats) is an outlier detected using statistical bounds calculated by univOutl::LocScaleB(). Purpose: Identifies unrealistic values in the number of boats using statistical thresholds. Outliers are flagged and set to NA.
Alert Code 9: Condition: The catch weight for individual fish groups (catch_kg) exceeds statistical bounds based on gear type and fish category, calculated by univOutl::LocScaleB(). Purpose: Detects outliers in individual catch records, ensuring consistency and accuracy in reported fish group catch weights.
Alert Code 10: Condition: The total catch weight (total_catch_kg) exceeds statistical bounds based on landing site and gear type, calculated by univOutl::LocScaleB(). Purpose: Identifies anomalies in total catch weight across landing sites and gear combinations. Outliers are flagged and set to NA.
Data Analysis
Temporal Distribution
The following plot shows survey activity across BMUs. Hover over tiles to view the number of surveys conducted in specific quarters and landing sites:
Code
quarter_format<-function(date){year<-lubridate::year(date)quarter<-lubridate::quarter(date)return(paste0("Q", quarter, "\n", year))}basedata<-valid_data%>%dplyr::mutate(landing_date =as.Date(as.POSIXct(landing_date, origin ="1970-01-01")))%>%dplyr::group_by(submission_id)%>%dplyr::summarise(dplyr::across(dplyr::everything(), ~dplyr::first(.x)))%>%dplyr::ungroup()%>%dplyr::mutate(quarter_date =lubridate::floor_date(landing_date, unit ="quarter"))%>%dplyr::group_by(quarter_date, landing_site)%>%dplyr::count()%>%dplyr::ungroup()%>%tidyr::complete(quarter_date, landing_site, fill =list(n =0))main_plot<-basedata%>%ggplot()+theme_bw()+geom_tile_interactive(aes( x =quarter_date, y =reorder(landing_site, n), fill =n, alpha =log(n), tooltip =paste("Quarter:", quarter_format(quarter_date), "<br>Landing Site:", landing_site, "<br>Surveys:", n)), color ="white", size =0.3)+scale_fill_viridis_c(direction =-1)+coord_cartesian(expand =FALSE)+theme( panel.grid =element_blank(), strip.background =element_rect(fill ="white"), legend.position ="bottom", legend.key.width =unit(1, "cm"), strip.text.y =element_text(angle =0, vjust =0.5, hjust =0.5, face ="bold"),)+guides(alpha ="none")+scale_x_date( date_breaks ="3 years", labels =quarter_format, expand =c(0, 0))+labs( title ="BMUs activity", subtitle ="Ranked temporal distribution of survey activities across BMU's, illustrating survey frequency over time.", x ="Quarter of the Year", fill ="Number of surveys", y ="")side_plot<-basedata%>%dplyr::group_by(landing_site)%>%dplyr::summarise(total_surveys =sum(n, na.rm =TRUE))%>%ggplot()+theme_minimal()+geom_bar_interactive(aes( y =reorder(landing_site, total_surveys), x =total_surveys, tooltip =paste("Landing site:", landing_site, "<br>Total surveys:", total_surveys)), stat ="identity", fill ="#52817f", alpha =0.5)+geom_text_interactive(aes( y =reorder(landing_site, total_surveys), x =total_surveys, label =scales::label_number(scale =1/1000, suffix ="K")(total_surveys), tooltip =paste("Total surveys:", total_surveys)), hjust =-0.1, size =3)+labs(x ="", y ="")+theme( panel.grid =element_blank(), axis.text.y =element_blank(), axis.text.x =element_blank(), axis.ticks.y =element_blank(), axis.title.y =element_blank())+coord_cartesian(x =c(0, 11000))combined_plot<-cowplot::plot_grid(main_plot,side_plot+theme(plot.margin =margin(0, 10, 0, -5)), ncol =2, rel_widths =c(4, 1), align ="h")girafe( ggobj =combined_plot, options =list(ggiraph::opts_tooltip( opacity =0.8, use_fill =TRUE, use_stroke =FALSE, css ="padding:5pt;font-family: Open Sans;font-size:1rem;color:white")))
Fishery Metrics
This section presents key metrics used to monitor and assess fishery dynamics across different Beach Management Units (BMUs). These standardized indicators allow for comparisons across areas and time periods while accounting for sampling effort, spatial variations, and temporal normalization.
The metrics are calculated at two main scales:
Sample Level: For each day at a specific BMU, metrics are computed by aggregating data from all landing events on that day. This involves summing the number of fishers and total catch, calculating the average catch per trip, and deriving normalized indicators such as effort (fishers per km²), CPUE (kg per fisher), and CPUA (kg per km²) based on the BMU’s area.
Monthly Aggregation: The sample-level metrics are further aggregated by BMU and month. This involves averaging daily measures over each month, yielding monthly mean values for trip catch, effort, CPUE, and CPUA.
bmus_selected<-export_data%>%dplyr::group_by(BMU)%>%dplyr::summarise(missing_cpue =mean(is.na(mean_cpue))*100)%>%dplyr::arrange(dplyr::desc(missing_cpue))%>%dplyr::filter(missing_cpue<75)%>%dplyr::pull(BMU)int_plot_cpue<-export_data%>%dplyr::filter(BMU%in%bmus_selected)%>%ggplot(aes(x =date, y =mean_cpue))+theme_minimal()+facet_wrap(~BMU, scales ="free_y", ncol =3)+ggiraph::geom_line_interactive(aes( group =1, tooltip =sprintf("Date: %s\nCPUE: %.2f",format(date, "%B %Y"),mean_cpue)), size =0.5, alpha =0.25, color ="#628395")+ggiraph::geom_point_interactive(aes( tooltip =sprintf("Date: %s\nCPUE: %.2f",format(date, "%B %Y"),mean_cpue)), size =0.5, alpha =0.5, color ="#628395")+scale_alpha_continuous( range =c(0.2, 0.8), guide ="none")+stat_smooth(method ="loess", col ="firebrick", alpha =0.25, size =0.5)+labs(x ="", y ="Mean CPUE (kg/fisher/day)")+theme( panel.grid =element_blank())ggiraph::girafe( ggobj =int_plot_cpue, options =list(ggiraph::opts_tooltip( opacity =0.8, use_fill =TRUE, use_stroke =FALSE, css ="padding:5pt;font-family: Open Sans;font-size:1rem;color:white")))
Catch Per Unit Area (CPUA) is calculated daily for each BMU by dividing the total catch for the day by the area of the BMU in square kilometers. These daily CPUA values are then averaged across all days within each month for the corresponding BMU:
int_plot_cpua<-export_data%>%dplyr::filter(BMU%in%bmus_selected)%>%ggplot(aes(x =date, y =mean_cpua))+theme_minimal()+facet_wrap(~BMU, scales ="free_y", ncol =3)+ggiraph::geom_line_interactive(aes( group =1, tooltip =sprintf("Date: %s\nCPUA: %.2f",format(date, "%B %Y"),mean_cpua)), size =0.5, alpha =0.25, color ="#628395")+ggiraph::geom_point_interactive(aes( tooltip =sprintf("Date: %s\nCPUA: %.2f",format(date, "%B %Y"),mean_cpua)), size =0.5, alpha =0.5, color ="#628395")+scale_alpha_continuous( range =c(0.2, 0.8), guide ="none")+stat_smooth(method ="loess", col ="firebrick", alpha =0.25, size =0.5)+labs(x ="", y ="Mean CPUA (kg/km²/day)")+theme( panel.grid =element_blank())ggiraph::girafe( ggobj =int_plot_cpua, options =list(ggiraph::opts_tooltip( opacity =0.8, use_fill =TRUE, use_stroke =FALSE, css ="padding:5pt;font-family: Open Sans;font-size:1rem;color:white")))
Fishing effort density is calculated daily for each BMU by dividing the number of fishers by the area of the BMU in square kilometers. These daily effort values are then averaged across all days within each month for the corresponding BMU:
int_plot_rpua<-export_data%>%dplyr::filter(BMU%in%bmus_selected)%>%ggplot(aes(x =date, y =mean_rpua))+theme_minimal()+facet_wrap(~BMU, scales ="free_y", ncol =3)+ggiraph::geom_line_interactive(aes( group =1, tooltip =sprintf("Date: %s\nRPUA: %.2f",format(date, "%B %Y"),mean_rpua)), size =0.5, alpha =0.25, color ="#628395")+ggiraph::geom_point_interactive(aes( tooltip =sprintf("Date: %s\nRPUA: %.2f",format(date, "%B %Y"),mean_rpua)), size =0.5, alpha =0.5, color ="#628395")+scale_alpha_continuous( range =c(0.2, 0.8), guide ="none")+stat_smooth(method ="loess", col ="firebrick", alpha =0.25, size =0.5)+labs(x ="", y ="Mean RPUA (KES/km²/day)")+theme( panel.grid =element_blank())ggiraph::girafe( ggobj =int_plot_rpua, options =list(ggiraph::opts_tooltip( opacity =0.8, use_fill =TRUE, use_stroke =FALSE, css ="padding:5pt;font-family: Open Sans;font-size:1rem;color:white")))
The combined metrics provide a comprehensive view of the fishery’s performance:
Catch Metrics:
CPUE (kg/fisher/day): Measures fishing efficiency and resource abundance
CPUA (kg/km²/day): Indicates spatial productivity of fishing grounds
Effort Metrics:
Effort (fishers/km²/day): Quantifies fishing pressure on the resource
Revenue Metrics:
RPUE (KES/fisher/day): Measures economic efficiency per fisher
RPUA (KES/km²/day): Indicates economic productivity per area
These standardized indicators enable: - Comparison of fishing performance across different BMUs - Assessment of temporal trends in resource use and economic returns - Evaluation of spatial variations in fishing pressure and productivity - Integration of ecological and economic dimensions of the fishery